home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Ian & Stuart's Australian Mac: Not for Sale
/
Another.not.for.sale (Australia).iso
/
fade into you
/
being there
/
Internet Accounting
/
Measuring Use - Ray
/
ray.txt
< prev
Wrap
Text File
|
1994-05-16
|
37KB
|
817 lines
Measuring Network Utilisation at the University of Melbourne
[ascii version]
Douglas Ray
doug@unimelb.edu.au
Information Technology Services
University of Melbourne
Copyright 1993 University of Melbourne
--------------------------------------------------------------------------
0 context
1 task
2 network topology
3 sources of info
4 algorithm and constraints
5 NNStat and constraints
6 implementation architecture
7 performance considerations
8 ancillary info
9 conclusions
--------------------------------------------------------------------------
0 context
This paper explores a specific project in network usage monitoring,
currently in use at the University of Melbourne. The project was
partly prompted by the perception that at some stage charging for
external network traffic may become necessary within the University.
"Usage" is a fiddly thing to define, because unless you do protocol
emulation on every connection you can't always determine which side
of a connection initiated a data transfer. This paper does not
address that problem.
To date, the possible mode of charge has been assumed to be a measure
of volume usage, possibly weighted by type-of-service. Byte and
packet measures have been derived, and per-service figures are kept.
Within this limited frame there are many possible strategies. Links
have fixed bandwidth, routers have upper limits on packet switching, so
depending on the degree of saturation, and whether a route is saturated
at link or at gateway, different weightings of byte and packet measures
may be applicable - *eg*, straight byte-based accounting for saturated
links, and a combination of byte and packet accounting for
non-saturated links and router bottlenecks.
However, this begs the question of what is trying to be achieved in
charging. The rationale behind the above example is that charges
should directly reflect the cost of the infrastructure (links and
routers) being used. Further, the implication is that charging is a
means of recouping costs. This ignores the utility of charging in
managing usage patterns.
If you're trying to curtail usage of saturated links, there's not much
point putting a financial premium on their use unless the person being
charged knows when they are suffering the premium. To effect such
behaviour it is more effective - not to mention much simpler - to
specify peak usage periods, and charge a premium on usage in those
periods.
These, and other strategies, and references, can be found in Roger
Clarke's paper for this conference. My concern is not so much the
charging strategy as the method of measuring usage. The foregoing
discussion serves to emphasise that "usage" has as many meanings as one
chooses to define. The definition we're after is the one which
saves us money.
--------------------------------------------------------------------------
1 task
The task I was presented with was to measure and characterise types of
traffic between parts of the University and the rest of Victoria, greater
Australia, and overseas. (By "characterisation" of traffic I mean
measuring various types of services used - telnet, mail, ftp, etc.)
It quickly became apparent that it would be useful to log a fourth class
separately: traffic with the central AARNet servers.
Only IP traffic is considered. This forms the bulk of traffic exiting
the campus network.
"Parts of the University" was clarified to be subnets. This may need to
be refined in the light of possible charging applications. If one
assumed we wanted to generate bills at a departmental or faculty level
there are a few problems we must solve.
A) Departments with mixed subnets
This is simple - add the subnets - and we probably need to do this
anyway. However, there is a substantial chunk of work in adding the
further level of abstraction, department, on top of the existing
subnet-based processing.
B) Subnets with mixed departments
This is the killer. The best solution is not to mix departments on a
subnet, but given our current subnet usage that would be impractical.
Also, there are some departments that naturally have hosts dotted all
over campus; eg, the library. All solutions, other than segregation of
departments at the subnet level, require changing the granularity of
local peer from subnet to host, and collating the host info by
department. (For a note on the practicality of this, see section
7). Even if we presume that performance limits would allow this, we
also need procedures which enable automatic updates of our stats
monitor config whenever new hosts appear, move between departments, or
when departments move between subnets.
One solution is to collate IP's by DNS zone name. This presumes that
the zone names in the DNS adequately reflect departmental
structure for billing purposes.
If they don't, then a separate table of IP number/department pairs (or
hostname/department pairs, with the IP number being found by DNS query).
This assumes that the list will be maintained - which requires extra
work on updates - or that the DNS files will be automatically
regenerated from the list. It also assumes that those departments
running their own nameservers remember to forward information of new
hosts to us... information for which they will be billed.
C) Subnets with shared resources
As an example, we have various hosts in lecture theatres, for which
someone would have to work out a billing system.
--------------------------------------------------------------------------
2 network topology
We measure this traffic on the following network.
[FIGURE]
The key points of this topology are as follows. The AARNet spine has
* servers
* a link to Internet via the USA
* links to other states via the national router
* links to Victorian institutions via the Victorian router
One of these latter links connects to the University of Melbourne's
ethernet spine. The UoM ether spine supports two routers making
redundant connections to the main FDDI ring around campus. (There are
a further 6 cisco routers on the FDDI ring.)
The University is primarily a class B network using a subnet mask of
255.255.255.0. Only a handful of subnet numbers are not in use.
Variable length subnet masks have recently been enabled, but the
implications of this for our stats reporting methodology are not
discussed here.
--------------------------------------------------------------------------
3 sources of info (on campus)
3.1.1 cisco routers
3.1.2 stats servers
3.1.2.1 departmental stats servers
3.1.2.2 central stats server
3.2 geographic info
3.1.1 cisco routers
The cisco routers can be configured for IP logging. This gives IP source
and destination pairs, with packet and byte counts. The drawbacks of
this approach are -
* can't distinguish type-of-service.
(for this we'd need the port numbers)
* no control of granularity: must collate info for each host-host
pair (not to mention download the info across the network)
* load on the routers
(IP logging imposes a substantial CPU overhead on ciscos)
* doesn't give geographic location of peer.
3.1.2 stats servers
Dedicated stats servers have several advantages over router accounting.
Using generic packet filtering, we can select by
* fields in the ethernet header
* fields in the IP header
* derived info ("virtual fields": eg, network and subnet)
This makes it easy to differentiate type-of-service. Dedicated stats
servers impose little or no performance overhead on network load.
(We find this functionality in the NSFnet software package, NNStat,
which is used on the AARNet stats server *vovo* and on our stats
server.)
If we use stats servers, where do we put them?
3.1.2.1 departmental stats servers
A dedicated stats server on each subnet would have the advantage of
measuring real traffic on the subnets, rather than just traffic
crossing a cisco interface.
However, though this could be useful for maintenance and load
prediction, it is not part of the task under discussion. It is also
particularly expensive, both in equipment and configuration time.
3.1.2.2 central stats server
A single stats server on the UoM ether spine would see each packet entering
or exiting the University. Results can be processed locally on the
server rather than being shipped across the network, and the presence of
the server doesn't affect network performance. This is the approach
we've taken.
3.2 geographic info
Remember, our task is to separate Victorian, Australian and international
traffic. How do we get the geographic location of the peer?
Traffic crossing the UoM ether spine will have an ethernet address (either
source or destination) of the victorian gateway. This differentiates the
traffic we want from any other stuff, but still doesn't give us the geographic
info we need.
The only other source of information at the UoM ether spine is the IP
address of the peer. That means we need a list of network numbers
and their geographic location. What range of network numbers do we
wish to cover? We don't need (or want) every network in the world.
If we had, separately, all Victorian networks and all other Australian
networks, then anything else can be assumed to be overseas. (But all
we really require are those Victorian and Australian nets that talk
with us.)
We might consider traceroute (but not for long :). Even with a maximum
hop count of two, using traceroute to determine the location of IP
numbers would be an excellent way of consuming system resources. At
one extreme one traces every address not currently known, recording
Victorian or Australian networks in lists - then each overseas address
generates a trace whenever it occurs. T'other extreme, we maintain a
list of overseas addresses as well... but that list would get rather
large. Okay, well, we can *cache* the overseas addresses... but one
starts to sense it could be worth looking for a simpler solution.
There's a file which contains details of all Australian IP networks, on
munnari.oz.au: "netinfo/status". Unfortunately, although the
network numbers and network names are kept up to date, the geographic
information is not complete.
Traffic crossing the AARNet spine has an ethernet peer address of the
Victorian gateway, the national gateway, or the usa gateway (for
simplicity we'll ignore the fiji gateway). This is where we can
distinguish our target geographic classes. Luckily, the AARNet stats
server (vovo) already does this, so we can get lists of Australian and
Victorian IP networks from there. We download the files listing
Victorian networks and their Australian network connection peers.
This is an adequate approximation of the required "Victorian and
Australian nets that talk with us", ignoring any Victorian nets that
talk exclusively with us.
--------------------------------------------------------------------------
4 algorithm and constraints
4.1 algorithm
4.2 constraints of the algorithm
4.1 algorithm
Here we discuss the algorithm used for sorting and collating the traffic.
As mentioned above, the common factor in all traffic we're interested in
is that, when crossing the UoM ether spine, it has either a source or
destination ethernet address of the victorian gateway. This gives us
the outer tests of our algorithm. Given the lists of network numbers
we can easily derive the basic structure, which uses eight sets of tests
to record traffic in and out of each of our four target locations:
if ether source = vic.gw
then {
if IP source network is in VICnets
record in incoming-traffic-from-vic
else if IP source subnet is in AARNspine
record in incoming-traffic-from-aarn
else if IP source network is in AUSnets
record in incoming-traffic-from-aus
else
record in incoming-traffic-from-os
}
else if ether destination = vic.gw
then {
if IP destination network is in VICnets
record in outgoing-traffic-to-vic
else if IP destination subnet is in AARNspine
record in outgoing-traffic-to-aarn
else if IP destination network is in AUSnets
record in outgoing-traffic-to-aus
else
record in outgoing-traffic-to-os
}
From here, the pseudocode "record" must be expanded to a block of tests
which logs at the particular level of granularity required. (This
will be discussed later: see section 6.2.1.) In our context this means a
series of tests to distinguish port information, and special tests for
particular servers and subnets.
4.2 constraints of the algorithm
Traffic with addresses not registered in our lists of Victorian and
Australian networks will be wrongly attributed to overseas traffic.
Also, networks are dynamic. More arrive continually, on something
between a daily and weekly basis. To address these points we need
timely updates of the network lists.
Unavoidably, we also need to rebuild the config of our network
monitoring software on a daily basis, to include updated network lists
in an automated and robust fashion. (Or, rather, we need to check
to see if it needs rebuilding.)
While doing all this it would be prudent to log changes of config with
detail sufficient that we can reverse any problems. If someone
broadcasts a bogus net we don't want it on our server forever.
--------------------------------------------------------------------------
5 NNStat
5.1 Intro
5.2 Constraints of NNStat
5.2.1 single statspy per server
5.2.2 polling rewrite bug
5.2.3 NNStat config limits
5.2.4 subnet bug
5.2.5 "select" bug?
5.2.6 separate configs for statspy and collect
5.1 Intro
These comments apply to version 3.2 of NNStat, which was the current
version at the start of 1993. A beta release of version 3.3 has been
recently released.
NNStat is distributed as source code for Sun and DECstation
platforms. It relies on a permissive mode network interface, using
the NIT device on SunOS and the packetfilter option on ULTRIX. Two
daemons do most of the work. Statspy, the monitor process, scans
packets available on the network interface and matches them against a
pattern config to control various counting operations. Collect, the
logger process, periodically interrogates the statspy monitor and logs
the information in files. A third tool, rspy, allows interactive
query and control of the statspy monitor. (Rspy is only used for
diagnostic purposes in our system.)
Counting operations are object-based, and the collect logger can be
told to look for all objects or given subsets of objects. Checkpoint
and logging periods are constant for a given collect process; using
several collect processes, one can log various objects at different
temporal resolutions.
Three periods - the polling, checkpoint and clear intervals - can be
set independantly for each collect process. At each polling period,
collect gathers totals for its chosen objects from statspy, and logs
them to files. Subsequent polls overwrite the previously logged
record, until the checkpoint period expires, at which time the next
record to be logged is appended to the file. Totals between
checkpoints (and polls) are cummulative until the clear period expires,
at which point all counters are reset.
5.2 Constraints of NNStat
A number of constraints are imposed by the current implementation of NNStat.
5.2.1 single statspy per server
One can only have one monitor process - and one active config - per server.
Because of this we can't test configs on the production server without
interrupting stats collection. We either suffer downtime or use a
developmental server.
5.2.2 polling rewrite bug
The method NNStat uses for the collect processes to log info to files
is buggy: logs sometimes acquire inter-record garbage. Because of
this the log files must be parsed with some non-intuitive tests during
postprocessing.
5.2.3 NNStat config limits
There are various fixed limits in the NNStat code, some of which aren't
documented. These limits can be changed by recompiling the source, but
any given config must work within the limits of the currently installed
executables. The limits which we know of are:
maximum number of objects for collector
maximum number of cases in a "select" statement
maximum number of parameters for object class definitions
These limits have all been raised in our installation. Some config
overflow errors are silently ignored, some trigger core dumps - either
way, we must check that they aren't exceeded when building the config.
5.2.4 subnet bug
There is a bug in 3.2 preventing access to subnet information on
little-endian architectures (eg, DECstation). A patch for this was
found by Gavin Stone-Tolcher (UQ).
5.2.5 "select" bug?
There may be a bug in the number of cases recognised by the "select"
statement - either that or a confusion of datatypes has led to only
half the specified maximum being useable. This will be confirmed when
our developmental server is configured.
5.2.6 separate configs for statspy and collect
The monitor process and logging processes look at separate config files
for essentially the same information. This marginally complicates building
the configuration.
--------------------------------------------------------------------------
6 implementation architecture
6.1 Server Platform
6.2 Software
6.2.1 Config Builder
6.2.2 NNStat Traffic Logger
6.2.3 Postprocessing
6.1 Server Platform
The production server, noc, is a DECstation 5000/250 with 64M ram and
2G disk, running ULTRIX 4.3. (This is not an endorsement of DEC.)
While noc was being coaxed to sporadic life, the project was supported
on a Solbourne S4000 with 40M ram and 1G disk, running OS/MP 4.1a.3
(licensed SunOS, ~4.1.2).
6.2 Software
The present system is cobbled together from scripts. With marginal
editing of reality, it can be divided into the four functional units
shown in figure XXXX. (Most of the NNStat traffic logger and a good
portion of the postprocessing scripts were donated by Robert Elz, from
his work on the AARNet stats server.)
[FIGURE]
My aim is to automate the procedure as far as possible. Human
intervention should only be required for qualitative evaluations,
during the report generation phase. Currently, both the config
creation and traffic logging run untended. Postprocessing is still
initiated manually, pending rationalisation of error reporting and
satisfactory error handling. Report generation facilities are
meagre.
6.2.1 Config Builder
The config builder updates the NNStat config with new Victorian and
other Australian network numbers. It sifts through the network lists
obtained from vovo, and when new numbers are found, incorporates them
into the config. It maintains version details for the config, logs
which vovo file the new numbers came from, and logs when a new config
is installed. Figure XXXX gives a (rather simplified) view of the
process.
There are three main parts in the code. First there's a module
that manages the network numbers. This is called nightly from cron.
When new numbers are found, it calls the config builder module, which
reassembles the NNStat config from various schemas. Finally, if the
config was rebuilt successfuly successful, the config builder calls a
module which installs the config.
[FIGURE]
Integrated within this code are checks to ensure that a new config
doesn't exceed NNStat's limits (insofar as these are known). If these
are exceeded, the new config will not be installed, future config
updates will be suspended, and mail sent to the administrator. Before
this happens, when the config expands beyond certain thresholds,
warning mail is sent to the administrator. This will allow a new set of
executables to be compiled and installed before config updates are
interrupted.
The main schema for the config implements the algorithm discussed
earlier (see section 4.1). The blocks of tests referred to record
subnet and type-of-service, and, for the most part, have a common
structure. Accordingly, we generate all eight blocks by expansion of a
single schema of subnet tests - this simplifies modifications if we
need to change the level of detail of information logged.
The block which records traffic into the University from Australian
networks has the following form. First, the total for all incoming
ethernet traffic is logged. Traffic destined for hosts on the UoM
ether spine (excluding the fddi routers) is recorded, and then we deal
with traffic heading for the fddi ring. For traffic bound for the
FDDI ring, exceptions are handled - munnari. We keep track of
munnari's traffic because it supplies national services. All IP
traffic is logged, and then, separately, TCP traffic and UDP traffic.
Source and destination ports are logged separately for TCP and UDP.
Having dealt with munnari, we handle the rest of the University
similarly, logging first IP traffic totals, then subtotals for TCP
matched on source port, TCP matched on destination port, UDP matched on
source port and UDP matched on destination port. The destination
subnet is recorded, to give some idea of which parts of the University
are in communication.
# traffic into campus from ausnets
record destination-subnet
select Ether.destination {
case ( HOSTS-ON-UoM-ETHER-SPINE )
record Ether.destination
case ( "rb1.rtr", "rb2.rtr" ): {
# to campus on FDDI ring:
if IP.destination is munnari {
record IP.destination
if TCP.destinationport is in TCP-PORTS-LIST
record TCP.destinationport
if TCP.sourceport is in TCP-PORTS-LIST
record TCP.sourceport
if UDP.destinationport is in UDP-PORTS-LIST
record UDP.destinationport
if UDP.sourceport is in UDP-PORTS-LIST
record UDP.sourceport
}
# All subnets (including munnari traffic):
record IP.destination-subnet
if TCP.destinationport is TCP-PORTS-LIST
record TCP.destinationport, IP.destination-subnet
if TCP.sourceport is TCP-PORTS-LIST
record TCP.sourceport, IP.destination-subnet
if UDP.destinationport is UDP-PORTS-LIST
record UDP.destinationport, IP.destination-subnet
if UDP.sourceport is UDP-PORTS-LIST
record UDP.sourceport, IP.destination-subnet
}
default : { # shouldn't be anything in here.
record Ether.destination, Ether.source
record IP.destination, IP.source
}
} # end select Ether.dst
6.2.2 NNStat Traffic Logger
The traffic logger is shown (almost verbatim) in Fig. XXXX.
The system is started at boot time from rc.local, and restarted at
midnight from cron. There is a script to control starting the NNStat
processes (restart.noc) and another which stops them (stop.noc).
[FIGURE]
The collect processes log information in files named after the object being
logged, one file per object, and given a suffix indicating the time at
which the collect process was invoked. The timestamp causes the
collect processes to log to different files (unless processes are
started within the same minute).
We choose to stop and restart the statspy and collect processes daily.
This means that we get a new set of log files each day, which
simplifies expunging corrupted logs from the data. It also makes it
easy to install a new config when one is required. (We could manually
update the runtime config on statspy via rspy, but as we have to
generate the new config anyway, and we have an excuse for stopping the
logging, there would be no point in such sophistication.)
Process IDs are recorded and timestamped whenever the processes are
stopped or restarted, and a comment can be automatically inserted in
the log when the scripts are run manually. When statspy is started
it gives a commentary on what it thinks of your config. This
diagnostic message is superficially parsed by the start-up script
(restart.noc) to detect key words ("error", "warning"), and when
problems are detected mail is sent to the administrator.
One action not represented on the diagram is the weekly transfer of the
traffic logs. Once a week, between stopping and restarting the NNStat
processes, the *daily* script renames the current logging directory
with a timestamp and recreates the logging directory. (Our reports are
based on weekly aggregates).
6.2.3 Postprocessing
5.2.3.1 Input
5.2.3.2 Output
5.2.3.3 Reconcilliation
5.2.3.4 Geographic totals pipeline
5.2.3.5 Type-of-Service totals pipeline
Under postprocessing we subsume the evils that happen after the week's
data has been logged. The processing modules of this section form a
number of data pipelines applicable to different groups of objects.
5.2.3.1 Input
The input, the raw traffic log files from the NNStat logger, are byte
and packet totals sampled cummulatively over each day for a series of
"data objects". We choose to generate weekly totals. The records for
a given object will be distributed over a number of files, a new file
for each time the collect processes are restarted.
The objects we log can be divided into two groups. Most are sampled at
30 minute intervals and checkpointed only at the 24h mark. A smaller
group are sampled at 5 minute intervals and checkpointed every 15
minutes. The latter are the IP logs used for generating total traffic
figures for each target geographical destination, and containt subtotals
for each subnet.
Below is a sample NNStat record for information coming into campus from
Australian networks. (Local subnet and subtotals have been
overwritten for mystique). The first figure after the subnet key is
the number of packets, and the second figure the number of bytes.
"Total Count" sums the packet information.
OBJECT: F.IN.dstsubn-all.from-ausnets.ip Class= freq-all-bytes [Created: 00:00:34 11-21-93]
ReadTime: 00:00:00 11-22-93, ClearTime: 00:00:40 11-21-93 (@-86360sec)
Total Count= 649370 (+0 orphans)
Total Bytes= 48370339B #bins = 43
[128.250.aaa.0]= 999999 &99999999B (75.6%) @-0sec
[128.250.bbb.0]= 88888 &8888888B ( 6.9%) @-17sec
[128.250.ccc.0]= 77777 &7777777B ( 4.9%) @-0sec
[128.250.ddd.0]= 66666 &6666666B ( 2.4%) @-0sec
[128.250.eee.0]= 55555 &5555555B ( 2.2%) @-11sec
[128.250.fff.0]= 44444 &444444B ( 1.8%) @-7sec
...
The former, less frequently sampled group contain objects logging
type-of-service details for TCP and UDP ports for each subnet. Below
is a sample recording TCP source-port details for traffic from
Australian networks into campus. (Local subnet and subtotals have been
overwritten from boredom).
OBJECT: S.IN.dstsubn.from-aus.tcp.srcports Class= matrix-all-bytes [Created: 0
0:00:34 11-21-93]
ReadTime: 00:00:00 11-22-93, ClearTime: 00:00:41 11-21-93 (@-86359sec)
Total Count= 412647 (+0 orphans)
Total Bytes= 25095824B #bins = 45
[119 "NNTP" : 128.250.aaa.0]= 999999 &99999999B (90.8%) @-0sec
[513 "rlogin|rwho" : 128.250.bbb.0]= 88888 &888888B ( 2.7%) @-21797sec
[23 "Telnet" : 128.250.ccc.0]= 7777 &7777777B ( 1.6%) @-21090sec
[25 "SMTP" : 128.250.ddd.0]= 6666 &666666B ( 1.2%) @-21sec
[20 "FTP data" : 128.250.eee.0]= 5555 &5555555B ( 1.0%) @-7766sec
[20 "FTP data" : 128.250.fff.0]= 4444 &4444444B ( 1.0%) @-12179sec
[79 "Finger" : 128.250.ggg.0]= 3333 &333333B ( 0.4%) @-6410sec
...
5.2.3.2 Output
For each object we generate a single NNStat-style record sumarising the
weeks traffic.
5.2.3.3 Reconcilliation
As shown above, each NNStat record contains a header of summary info,
and a table of values. When ever our postprocessing modules read or
write an NNStat record, they verify that the byte and packet totals in
the header are still within a small percentage of the actual sums of
the table. (Anomalies are reported and can be detected by simple string
searches of the postprocessing log.) This simple reconcilliation
procedure has detected a number of instances where corrupted data
slipped past the initial log parsing. However, it is probably not
foolproof. It would be safer if further reconcilliation phases were
introduced within the pipeline. A good target would be ensuring that
the sum of IP traffic is not less than the sum of all TCP and UDP
services.
5.2.3.4 Geographic totals pipeline
Processing for simple sums of IP to and from of the geographic target
areas are conceptually straightforward.
All the records relevant to a given object are collated and the
effects of any restarts or clears between records are removed. The
last record is then the accumulation of the weeks data, so this record
is extracted. The results are sorted, with separate tables being
produced for packet and byte measures.
This forms our basic pipeline.
5.2.3.5 Type-of-Service totals pipeline
A number of extra modules must be inserted into the above unit to
handle type-of-service data. Information collected for source and
destination ports must be combined. Most protocols can be adequately
handled in one of two ways, depending on whether they establish a
connection with a well-known port number at one end (sum source and
destination values) or both ends (take the maximum of source and
destination values). At this point we have possibly several entries
for a single subnet, so we must sum matching subnet/port combinations.
Various services use multiple ports, so we have another module to sum
these. Finally, TCP and UDP information is combined, and then the
sorting modules are invoked.
--------------------------------------------------------------------------
7 Performance Considerations
On our platforms the NNStat processes appear to be CPU-limited.
The figures below are from the production server, a DECstation 5000/250.
They describe performance under the current config, which sets the
resolution of local peer to be at the level of subnet.
The current config logs something over 30M of files per week. The
files produced by postprocessing are less than 2M. This is composed
of 541 objects checkpointed daily, and 144 objects checkpointed at
15 minute intervals. The latter form more than 90% of the bulk.
Any time-based accounting could substantially increase disk usage.
For the collect processes, size in virtual memory and the resident set
size appears to be static and easily manageable. CPU time is
dependant on the number of objects and amount of information per
object, but the total CPU time expended over a 24h period is quite
modest.
USER PID %CPU %MEM SZ RSS TT STAT TIME COMMAND
ray 27256 0.0 0.4 1188 200 ? S 0:00 collect -m 644 ...
ray 27248 0.0 0.4 1188 220 ? S 0:28 collect -m 644 ...
ray 27240 0.0 0.4 1188 220 ? S 0:28 collect -m 644 ...
ray 27231 0.0 0.4 1188 216 ? I 1:19 collect -m 644 ...
Of the above processes, the first is collecting information for one object.
The middle two are each collecting information on 270 objects. The lower
one is collecting information on 144 objects polled at 5 minute intervals;
the upper three are polling statspy at 30 minute intervals.
The statspy process consumes appreciable CPU time, and expands
gradually throughout the logging period. Memory will probably not
become a problem, given that we're restarting the process every 24h -
the process size starts at less than 2M (the snapshot below is 60"
after invocation):
USER PID %CPU %MEM SZ RSS TT STAT TIME COMMAND
ray 7495 0.4 2.6 1952 1448 ? S 0:11 statspy parm.noc
and on an average day grew to something over 2M:
USER PID %CPU %MEM SZ RSS TT STAT TIME COMMAND
ray 27219 0.0 3.1 2252 1712 ? S N 149:33 statspy parm.noc
However, it consumed some 9000 seconds of CPU time in the period.
Process accounting has shown this to be the mode, but there are
occaisional peaks of 25,000 seconds, which is a substantial portion of
the 86,400 seconds of one day. (It would be interesting to see what
memory consumption had got to on those days, but we've only just
started logging this.)
There's three things statspy might be spending CPU time on: incrementing
counters; working out which counter to increment; or downloading
figures to the collect process.
We haven't observed a marked increase in statspy's CPU consumption
during the periodic polls by the collect processes. This implies that
CPU usage is basically spent in processing traffic, and we presume
dominated by the complexity of pattern matching encoded in the config.
If this is the case then, at peak traffic loads, our current server
probably won't cope with a substantially more complicated config.
We have been using the ULTRIX packet filter with the default maximum
queue of 32 packets (NNStat automatically requests maximum queue
length). (We haven't checked what diagnostics we'd receive from either
NNStat or the kernel in the event of an overflow, but will investigate
this as soon as the development server is configured.)
As mooted in the initial discussion of our task (see section 1), subnet
resolution may not be sufficient for some applications. Converting
the config from the subnet paradigm to a departmental paradigm - moving
the peer resolution towards individual hosts - dramatically increases
the complexity of the config, in the worst case by more than two orders
of magnitude. Clearly we can't ask CPU time to increase by this
factor. (Nor, for that matter, disk usage, but we can collate host
info without logging it to file.)
It remains to be seen whether some intermediate solution is plausible.
Answering this reliably requires determining the CPU cost factors of
various formats of config. Note that the actual CPU usage under any
config depends on the profile and quantity of traffic. If reliable
estimates of CPU costs are determined, then the worst case traffic
profiles and maximum traffic loads that a stats server will cope with
can be predicted. (Which could be put another way - that any stats
server is only reliable up to a given load of traffic, and it would
probably be useful to know what that load is :)
--------------------------------------------------------------------------
8 ancillary info
8.1 time
8.2 system downtime
8.3 anomaly tagging
8.4 config version tagging
Here are things we should be doing but aren't, or are still developing.
8.1 time
A network stats server must have a reliable time source, particularly
if the information collected is used to generate bills. Installing the
NTP suite is a suitable solution.
8.2 system downtime
What we miss is as important as what we get. Every report must note
server downtime, and where possible estimate the effect of this on
recorded results. Even merely registering downtime is a messy problem.
We identify three components: downtime due to
platform failure (OS or hardware);
stats system failure (s/w error or resource privation)
interruptions in network connectivity.
Each of these requires its own method of monitoring and logging.
8.3 anomaly tagging
There are a range of situations that are useful to keep in mind when
viewing stats - eg, interruptions to logging, network storms, partial
network failures, connection of new services, and changes in filtering
policy. This information tends not to be collated anywhere. It is
useful not only to log these events consistently, but to have some way
of tagging potentially anomalous data with cross-references to the
related events.
8.4 config version tagging
Unfortunately, different postprocessing functionality is sometimes
required in different config versions. Most simply, when new objects
are logged, they must be processed. Changes in the config must be
logged. For any project that will last longer than a few months it
would be worth considering using the config version to tag the data,
and enabling automatic selection of an appropriate postprocessing
regime.
--------------------------------------------------------------------------
9 conclusions
Nothing is as simple as it seems. Even given perfect network stats
software, some ancilliary information is required. As far as possible,
inclusion of ancilliary information should be automated, as it
is inordinately fiddly and time-consuming to collate manually.
It has proved to be practical to log type-of-service information at the
resolution of the local subnet for a class B network with class C netmask.
Subnet-level resolution is probably insufficient for billing purposes,
particularly if billing at a departmental level rather than faculty
level.
It is not known whether increasing resolution to the level of local hosts
is practical on the current server platform. This can be investigated
when the develpmental server is configured.
The primary restrictions on expanding the stats system are expected to
be CPU time and disk space.